Prosper Loan Analysis using R

By Xia Chen, 2018-11-10, Toronto

1.Introduction

Prosper is America’s first marketplace peer-to-peer lending platform, with over $12 billion in funded loans. This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit score, etc. The explaination of all the variables in the data can be found here

In this project, an Exploratory Data Analysis (EDA) for the Prosper loan data set was conducted, in order to dig out some inspiring and valuable insights of the loan. Then Machine learning algorithms was applied to predict the default risk.

The outline of this project is:

  1. Introduction 1.1 Background 1.2 Install R packages 1.3 Data loanding and exploring
  2. Data Cleaning 2.1 Missing values 2.2 Drop columns and rows 2.3 Filling null values with column mean
  3. Exploratory Data Analysis 3.1 Univariate Plots Section 3.2 Bivariate Plots Section 3.3 Multivariate Plots Section
  4. Feature Engineering 4.1 Data pre-processing 4.2 Feature selection 4.3 Model validation and selection 4.4 Default prediction
  5. Conclusion and summary

1.1 Install R packages

In this analysis, I used 10 packages for data manipulation, data visulization and correlation analysis. The name are below: ggplot2, date, dplyr, gridExtra, scales, choroplethrMaps, chroplethr, GGally, knitr.

1.2 Load data

## There are  113937 observations of 81 variables.

1.The Prosper loan data contains 113,937 loans that were issued through 2005-2014, with 81 variables for each loan. 2. The target is the loan status which could help investors estimate the default risk of borrowers. 3. The interesting features that might affect the loan status could from numerical variables, including borrower rate, monthly income,loan amount, credit score, term of the loan, total prospers loans. The fetures could also from and categorical variables, including employment status, loan purpose, home ownership, borrower state.

2 Data processing and cleaning

2.1 Duplicates

## There are  0 dupplicates

2.2 Missing values

##                          ListingKey                       ListingNumber 
##                                   0                                   0 
##                 ListingCreationDate                         CreditGrade 
##                                   0                                   0 
##                                Term                          LoanStatus 
##                                   0                                   0 
##                          ClosedDate                         BorrowerAPR 
##                                   0                                  25 
##                        BorrowerRate                         LenderYield 
##                                   0                                   0 
##             EstimatedEffectiveYield                       EstimatedLoss 
##                               29084                               29084 
##                     EstimatedReturn             ProsperRating..numeric. 
##                               29084                               29084 
##               ProsperRating..Alpha.                        ProsperScore 
##                                   0                               29084 
##           ListingCategory..numeric.                       BorrowerState 
##                                   0                                   0 
##                          Occupation                    EmploymentStatus 
##                                   0                                   0 
##            EmploymentStatusDuration                 IsBorrowerHomeowner 
##                                7625                                   0 
##                    CurrentlyInGroup                            GroupKey 
##                                   0                                   0 
##                    DateCreditPulled               CreditScoreRangeLower 
##                                   0                                 591 
##               CreditScoreRangeUpper             FirstRecordedCreditLine 
##                                 591                                   0 
##                  CurrentCreditLines                     OpenCreditLines 
##                                7604                                7604 
##          TotalCreditLinespast7years               OpenRevolvingAccounts 
##                                 697                                   0 
##         OpenRevolvingMonthlyPayment                InquiriesLast6Months 
##                                   0                                 697 
##                      TotalInquiries                CurrentDelinquencies 
##                                1159                                 697 
##                    AmountDelinquent             DelinquenciesLast7Years 
##                                7622                                 990 
##            PublicRecordsLast10Years           PublicRecordsLast12Months 
##                                 697                                7604 
##              RevolvingCreditBalance                 BankcardUtilization 
##                                7604                                7604 
##             AvailableBankcardCredit                         TotalTrades 
##                                7544                                7544 
##  TradesNeverDelinquent..percentage.             TradesOpenedLast6Months 
##                                7544                                7544 
##                   DebtToIncomeRatio                         IncomeRange 
##                                8554                                   0 
##                    IncomeVerifiable                 StatedMonthlyIncome 
##                                   0                                   0 
##                             LoanKey                   TotalProsperLoans 
##                                   0                               91852 
##          TotalProsperPaymentsBilled               OnTimeProsperPayments 
##                               91852                               91852 
## ProsperPaymentsLessThanOneMonthLate     ProsperPaymentsOneMonthPlusLate 
##                               91852                               91852 
##            ProsperPrincipalBorrowed         ProsperPrincipalOutstanding 
##                               91852                               91852 
##         ScorexChangeAtTimeOfListing           LoanCurrentDaysDelinquent 
##                               95009                                   0 
##       LoanFirstDefaultedCycleNumber          LoanMonthsSinceOrigination 
##                               96985                                   0 
##                          LoanNumber                  LoanOriginalAmount 
##                                   0                                   0 
##                 LoanOriginationDate              LoanOriginationQuarter 
##                                   0                                   0 
##                           MemberKey                  MonthlyLoanPayment 
##                                   0                                   0 
##                 LP_CustomerPayments        LP_CustomerPrincipalPayments 
##                                   0                                   0 
##                  LP_InterestandFees                      LP_ServiceFees 
##                                   0                                   0 
##                   LP_CollectionFees               LP_GrossPrincipalLoss 
##                                   0                                   0 
##                 LP_NetPrincipalLoss     LP_NonPrincipalRecoverypayments 
##                                   0                                   0 
##                       PercentFunded                     Recommendations 
##                                   0                                   0 
##          InvestmentFromFriendsCount         InvestmentFromFriendsAmount 
##                                   0                                   0 
##                           Investors 
##                                   0
  • There are 9 columns with over 90,000 missing values; some columns are either irrelevant or beyound the scope of this analysis. So they need to be removed.

2.3 Drop some irrelevant columns

2.4 Drop some rows with missing values

##                ListingCreationDate                               Term 
##                                  0                                  0 
##                         LoanStatus                       BorrowerRate 
##                                  0                                  0 
##          ListingCategory..numeric.                      BorrowerState 
##                                  0                                  0 
##                   EmploymentStatus           EmploymentStatusDuration 
##                                  0                                  0 
##                IsBorrowerHomeowner              CreditScoreRangeLower 
##                                  0                                  0 
##              CreditScoreRangeUpper                    OpenCreditLines 
##                                  0                                  0 
##              OpenRevolvingAccounts        OpenRevolvingMonthlyPayment 
##                                  0                                  0 
##               InquiriesLast6Months                     TotalInquiries 
##                                  0                                  0 
##               CurrentDelinquencies                   AmountDelinquent 
##                                  0                                  0 
##            DelinquenciesLast7Years           PublicRecordsLast10Years 
##                                  0                                  0 
##          PublicRecordsLast12Months             RevolvingCreditBalance 
##                                  0                                  0 
##                BankcardUtilization            AvailableBankcardCredit 
##                                  0                                  0 
##                        TotalTrades TradesNeverDelinquent..percentage. 
##                                  0                                  0 
##            TradesOpenedLast6Months                  DebtToIncomeRatio 
##                                  0                               8404 
##                StatedMonthlyIncome                 LoanOriginalAmount 
##                                  0                                  0 
##                 MonthlyLoanPayment                    Recommendations 
##                                  0                                  0 
##                          Investors 
##                                  0
  • After dropping the columns, several other columns have over 7544 missing values. So I removed them.

2.5 Filling missing values with mean

  • Now the data is pretty clean, but there is still 8404 missing values for debt to income ratio, and this is a very important variable, so I will fill the missing values with mean values.

2.6 Tidy column names and variables

##                        Term                  LoanStatus 
##                           0                           0 
##                BorrowerRate             ListingCategory 
##                           0                           0 
##               BorrowerState            EmploymentStatus 
##                           0                           0 
##    EmploymentStatusDuration         IsBorrowerHomeowner 
##                           0                           0 
##       CreditScoreRangeLower       CreditScoreRangeUpper 
##                           0                           0 
##             OpenCreditLines       OpenRevolvingAccounts 
##                           0                           0 
## OpenRevolvingMonthlyPayment        InquiriesLast6Months 
##                           0                           0 
##              TotalInquiries        CurrentDelinquencies 
##                           0                           0 
##            AmountDelinquent     DelinquenciesLast7Years 
##                           0                           0 
##    PublicRecordsLast10Years   PublicRecordsLast12Months 
##                           0                           0 
##      RevolvingCreditBalance         BankcardUtilization 
##                           0                           0 
##     AvailableBankcardCredit                 TotalTrades 
##                           0                           0 
##       TradesNeverDelinquent     TradesOpenedLast6Months 
##                           0                           0 
##           DebtToIncomeRatio         StatedMonthlyIncome 
##                           0                           0 
##          LoanOriginalAmount          MonthlyLoanPayment 
##                           0                           0 
##             Recommendations                   Investors 
##                           0                           0 
##            LoanCreationDate 
##                           0
  • After cleaning up the missing values, there are some column names and date type nee to tidy up.

2.7 Check data type

## There are  106290 observations of 33 variables.
  • After data cleaning step, now There are 106290 observations of 33 variables.

3. Exploratory Data Analysis

3.1 Univariate Plots Section

In this section, I want to start exploring the data by visulization The fetures that I am interested to investigate are below:

3.1.1 Loan Status

In risk management one important job is to build a predictive model to predict whether the loan will be default or not. So the loan status wil be the target variable. However, there are 12 loan status. But we need to label them as default or non defaul loans. ‘delinquent’, ‘default’ and ‘charge off’ are terms helping investor to estimate whether or not they can expect to collect on the outstanding debt at all. According to the definition of delinquencies from badcredit.org, an account will be considered severely delinquent if no payment has been received 60 days beyond the due date.Those delinquent accounts may have higher risk of default.

3.1.2 Define delinquent borrower

So I created a new variable ‘DelinquentBorrowers’ using ‘0’ to represent none delinquent borrowers and ‘1’ to represent delinquent borroweres, including loans that are noted default, charged off, and past due over 60 days. According to LendingClub, In general, a note goes into default status when it is 121 or more days past due. When a note is in default status, Charge Off occurs no later than 150 days past due (i.e. No later than 30 days after the Default status is reached) when there is no reasonable expectation of sufficient payment to prevent the charge off.

  • There is a high proportion of borrowers who are currently delinquent at least once. But the table summarized that only 13% of borrowers have severe delinquent records.
## # A tibble: 2 x 3
##   DelinquentBorrowers counts Percentage
##                 <dbl>  <int>      <dbl>
## 1                   0  91736       86.3
## 2                   1  14554       13.7

3.1.3 Histogram plot

  • There are four important features that I believe will tell the borrower’s ability to pay back the loan on time. They are the loan amount, monthly income, interest rate, and credit score. This histogram plots above give me an idea that most loans are small loans,so it make sense the interest rate might be higher. Indeed, 18.4% interest rate is quite high. Most borrowers’ are meddle class with monthly income around 5000 dollars, and their credit score is good, because the median value is 700.

3.1.4 Numeric univariate plots

3.1.5 Categorical univariables

The loan categories are give as numerical variables, in order to investigate the correlation of loan purpose with other features, the ‘ListingCategory’ column was coverted into categorical variable.

  • From the histogram plots for categorical variables, we get to know that there is high volumn of loans are for the purpose of debt consolidation. So later I need to investigate their creditline and other banck credit utilization informatin.

3.2 Bivariate Plots Section

3.2.1 Correlation heatmap

  • 14 variables are selected to investigate the correlation between those variables. Borrower rate has some correlation with employment status duration, bankcard utilization, bank card utilization, and monthly income.

3.2.2 Employment Status vs.Monthly income

-The monthly income is related to employment status, I cut the montly income into 6 levels and created a new variable MonthlyIncome.bucket. Interestingly, not employed does not mean low monthly income, employed and self-employed people also have very high monthly income.

3.2.3 Geographical distribution of loans

  • From a geographical perspective California, Texas, New York, Florida and Illinois have the largest dollar amounts and volumes of loans. The delinquent loans is geographically consistent with the high volume and large amount of the loan. But the average loan amount is higher in states Alaska and South Carolina, I think this might be because of the lower level of pupulation in these two states.

3.3 Multivariate Plots Section

    1. Borrower rate is related to prosper score, higher borrower rate leads to lower Prosper score. Delinquent borrowers might due to higher borrower rate.
    1. Deliquent borrowers are small loan borrowers. Also, deliquent borrowers tend to have higher Debt-to-income ratio.

3.4 Reflection

  1. The prospser loans dataset contains over 100k observations with 81 variables spanning across 10 years.The first step before conducting any data analysis is to understand the variables, terminology and general domain knowledge of financial peer-to-peer lending.Second, it is very important to determine which variables to analyze and stick to those variables without drifting too far off.Also there are so many missing values and none specific observations, cleaning is needed for looking into the relationship between some variables.
  1. For loan data, I believe that default risk analysis is a key component to help investors to decide if they could collect the full loan.So any features that could lead to default are worth investigating, In my analysis, I found that borrower rate from investor’s aspect and Prosper score from borrower’s aspect are strongly relatted to deliquencies. This brought me to investigate more features that could possibly affect borrower rate and prosper score, including monthly income, borrower region, loan purpose, employment status, credit card utilization, available credit, loan amount,home ownership, and debt to income ratio.

3.5 Limitations

The Exploratory Data Analysis strategy is a good way to find some insights of the date through interesting visualizations. However, EDA works better for data set with limited variables. For the Prosper Loan data with 81 variables, it is really time consuming to find out the most intesting features and investigate the correlations between them. In the future,an useful method to investigate this data could be building predictive models using machine learning.

4 Default prediction using machine learning

4.1 Data-preprocessing

4.1.1 Create dummy variables

4.1.2 Remove Highly Correlated Variables

  • Since the problem is a classification problem, I choose logistic regression, random forest and Adaboosting.

4.2 Model validation and selection

4.2.1 Model comparison and evaluation by running K-Fold cross validation

  • Since this project is a binnary classification problem, three algorithms were selected to compare: Logistic Regression, Naive Bayes, and Random Forest. The accuracy of those models were compared by running K-fold cross validation.
  • From the plot above, random forest give the best accuracy, which is 0.87.

4.2.2 Parameter tuning

  • In this section, two very important parameters mtry and ntree are tuned.
  • The error rate is lowestat 12.39% s when mtry =7.
  • OOB is bottoming out at approximately 1000 trees. So ntree=1000 would be good fit

4.2.3 Final Random Forest Classification

  • With the tunned parameters, the accuracy increased to 100%.

5 Conclusion

  1. The process of analyzing Prosper loan data is a good experience in data cleanning, visulizaion, and feature engineering.
  2. The advantages of random forest modeling include telling people about the importance of each variables.
  3. Some other features not included it the dataset could be also very important, such as the borrower’s age, and gender.
  4. This project could go further by carefully selecting the features. Future work could be predicting the loss and profit, predicting the interest rate to minimize the default, etc.

6 References